Genome partitioning

ABSTRACT

This invention relates to ‘genome partitioning’ and nucleic library construction, for example for sequence variation discovery and screening. The method employs a plurality of restriction enzymes in order to reliably reproduce a representative partition of the entirety of a sample nucleic acid based on the restriction ends of one or more ‘layers’ of the fragments present. In preferred embodiments there is provided a method for producing a nucleic acid library, which library contains a plurality of different nucleic acid fragments, the method comprising: (i) digesting the sample nucleic acid with a plurality of different restriction enzymes to generate a plurality of different layers of fragments, wherein each layer is a group of fragments having a unique combination of restriction ends, and wherein the combination of layers represents the entirety of the sample nucleic acid, (ii) optionally purifying said fragments, (iii) selecting a desired sub-set of layers according to the unique restriction ends of said layers, (iv) ligating said sub-set of layers into vectors adapted to receive it, (v) transforming host cells with the vectors (vi) culturing said host cells to provide said library containing said partition of the sample nucleic acid. The invention also provides systems, methods and functions for designing and optimising such libraries, and genotyping ‘chips’ based on the genome partitioning methods.

TECHNICAL FIELD

This invention relates generally to nucleic library construction, forexample for sequence variation discovery and screening. Particularly, itrelates to methods and materials for reproducibly cloning a subset of asample nucleic acid having reduced complexity.

BACKGROUND ART

Genetic markers are of increasing importance in the genomics andproteomics fields in understanding phenotype, susceptibility to disease,and response to treatments.

Single nucleotide polymorphisms (SNPs) are one of the most abundant anduseful markers, and are the subject of investigation in numerousdifferent organisms, including within the human genome. Methods whichhave been used in the art have included shotgun sequencing the wholegenome or sequencing PCR products (see e.g. Roth (2001) NatureBiotechnology 19: 209-211). Thus shotgun sequencing of the whole humangenome provided a few millions of SNPs from five different individualsas a by-product¹ to the main initiative. A more routine method is todesign a pair of specific primers for each DNA fragment of interest.After PCR amplification, the fragment can be purified and sequenced.Although these are widely used methods, their efficiency and throughputare very limited. Moreover, both of them are very costly.

Unfortunately the size of eucaryote genome make it difficult to searchor screen for DNA sequence variation between individuals. To addressthis problem, attempts have been made to reduce the complexity of thegenome to a more manageable scale, and thereby facilitate markerdiscovery.

AFLP is one method of achieving this. It had been widely used to studyDNA polymorphisms and AFLP markers have been mapped in many species².However, AFLP has not been used for SNP screening because of itstechnical limits, such as artificial sequence alteration, highproportion of random fragment loss and complexity of the procedure.

More recently, a more targeted and collaborative effort had been made toreduce the genome complexity for searching human SNPs.

This technology was called the reduced representation shotgun (RRS)strategy and it was adopted for the global human SNPs consortiumproject. RRS reduced the complexity of the genome by about six-fold,which increased the efficiency for finding the SNP. For RRS, the DNA isdigested with a restriction enzyme. Based on the distribution of thefragments at different sizes, a subset of the fragments can be cut outfrom an electrophoresis gel so that the subset only contains thefragments with a particular size interval. The isolated fragments aresubsequently be cloned into a library for random sequencing³ (see Roth(2001) Nature Biotechnology 19: 209-211).

EP 1001037 (Whitehead Biomedical Inst., US) describes such an RRSstrategy. A nucleic acid-containing sample to be assessed is treated tofractionate it into fragments selected in a sequence-dependent manner, asubset of which is selected on the basis of size.

The drawback of this method is that it can only reduce the genomecomplexity by a small scale.

Thus it can be seen that alternative methods of reproducibly reducingthe complexity of nucleic acid samples to a controllable scale e.g. formarker discovery, would provide a contribution to the art.

DISCLOSURE OF THE INVENTION

The present inventors have developed methods to reduce the complexity ofa sample of nucleic acid (e.g. genomic or cDNA library) in large,flexible and controllable scales by dividing the genome or a collectionof cDNA into smaller subsets. Briefly, the method uses multiplerestriction enzymes to cut the DNA into a collection of restrictionfragments. Based on the unique restriction ends of the fragments, theyare then divided into different groups or “layers”. A layer, or acombination of layers, is then cloned at a specific restriction sitesuch that the resulting library only contains the desired subset orpartition of the total sample. This permits the reduction of e.g. agenomic library's complexity more than a thousand-fold. By treating eachsample (or pooled samples) in this way, a highly consistent sub-set ofcorresponding fragments is generated in each case. Thus the method hasparticular utility for sequence variation discovery or screening throughdirect sequencing. Additionally it can be utilised within automatedsystems to provide high-throughput screening.

Thus in a first aspect there is provided a method for producing anucleic acid library, which library contains a plurality of differentnucleic acid fragments, the combination of said fragments being arepresentative partition of the entirety of a sample nucleic acid, themethod comprising:

-   (i) digesting the sample nucleic acid with a plurality of different    restriction enzymes to generate a plurality of different layers of    fragments,    -   wherein each layer is a group of fragments having a unique        combination of restriction ends,    -   and wherein the combination of layers represents the entirety of        the sample nucleic acid,-   (ii) optionally purifying said fragments,-   (iii) selecting a desired sub-set of layers according to the unique    restriction ends of said layers,-   (iv) ligating said sub-set of layers into vectors adapted to receive    it,-   (v) transforming host cells with the vectors-   (vi) culturing said host cells to provide said library containing    said partition of the sample nucleic acid.

Thus the method provides a reproducible method of reducing thecomplexity of the sample. By selection of the appropriate numbers ofrestriction enzymes, the type of restriction enzymes, and the sub-set oflayers ligated into said vectors, a partition with at least 10, 100, or1000-fold reduced complexity compared to the sample nucleic acid can begenerated.

In preferred embodiments, the method is performed (including,optionally, purification to remove short sequences e.g. less than 100bps) such that the sub-set of layers ligated into said vectors providesa library with fragments with a size range of 100-2000 bps.

The number of restriction enzymes, the type of restriction enzymes, andthe sub-set of layers ligated into said vectors are selected inaccordance with the equations set out hereinafter.

Choice of Nucleic Acid Sample

Nucleic acid for use in the present invention may include cDNA, RNA andgenomic DNA. It may be provided in amplified form. RNA may be providedas cDNA.

Generally speaking, for cDNA samples, the total size of the cDNA poolwill be smaller than a genome. Therefore, fewer enzymes will be used andpilot tests (see below) can be used to optimise the design.

The sample may represent all or part of a particular source of origine.g. may have been enriched.

Nucleic acids for use in the present invention may be provided isolatedand/or purified from their natural environment, in substantially pure orhomogeneous form, or free or substantially free of other nucleic acidsof the species of origin. Where used herein, the term “isolated”encompasses all of these possibilities.

Choice of Restriction Enzymes

In preferred embodiments, between 3 and 6 restriction enzymes will beused e.g. equal to, or at least, 3, 4, 5 or 6.

Preferably, the restriction enzymes are selected from four-, six- oreight-base-cutters.

Preferably, one or two six-base-cutters (which cut relatively rarely)are used as cloning-end-generators to create the cloning ends for thelayer(s) which are selected for cloning. The other restriction enzymesare four-base-cutters (which cut relatively more frequently) and whichare used, in effect, as fragment-cutters to destroy some or most of thefragments which could otherwise be cloned into the chosen vector. Theseenzymes therefore serve to reduce the size of the selected layer(s). Acombination of four- and six-base cutters as fragment cutters may beuseful to ‘hone’ the size of the partition.

Preferred restriction enzymes are selected from any of those given inTable 1. Eight-base cutters include SfiI and NotI. More preferably theenzymes HpaII, AluI, DraI, and PstI are used (PstI being used togenerate cloning ends).

However those skilled in the art will appreciate that other combinationsof enzymes may be selected as appropriate to the specific application inhand—for instances when all or part of a reference sequence for a sampleis known, the enzymes will be selected such as to have a targetfrequency appropriate to the size of the partition which it is wished togenerate. Likewise if it is desired to investigate a particular regionof the sample, the enzymes will be selected such as to achieve this.

Preferably the plurality of enzymes are used simultaneously, and areselected such as to be active under comparable conditions to permitthis. Optimum conditions for commercially available restriction enzymeare available from the manufacturers.

Restriction by one enzyme may be partial. In such cases it is preferredthat the group of fragments in the selected layer have restriction endscreated by said partial digestion.

Choice of Layers

In preferred embodiments, the selected sub-set of layers consists of onelayer or two layers

The following represent various preferred embodiments of the invention:

Design of Partitions for Samples with Unknown Sequence and Size

In some embodiments it may be required to generate a partition having adesired number of unique fragments where no reference sequence isavailable in a genome of unknown size. In this case the presentinvention may incorporate the performance of a ‘pilot test’ to confirmthe validity of the partition design, and optionally to refine it.

A pilot test may be used to measure the size or complexity (number ofunique sequences) of a particular partition design. It will also provideinformation about original genome size and restriction site frequencies.The principle is as follows: when sequencing a library (e.g. apartition) having a given number of colonies, there will be a chance fora particular sequence to be sequenced more than once. This is calledsequence redundancy of shotgun sequencing strategy. The more coloniessequenced the more redundancy. The smaller (or less complex) thelibrary, the more redundancy. Thus assessment of sequence redundancyprovides information about the size of the partition.

The function is described in this formula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s.Wherein:

-   F is the size or complexity of the partition-   n is the total number of good sequences obtained by sequencing-   ni is the number of sequence in the ith contig.-   s is the standard error, which represents the statistical error when    the sample size is not big enough.

Thus, for example, 500 colonies may be selected from a partition andsequenced. This should give more than 400 good quality sequences. Usingthese sequences, the complexity of the partition, F, can be calculated.Additionally, the deviation constant for restriction enzymes in thegenome can be extrapolated from the sequence results permitting a honingof the partition design.

Thus the method may include performing the method of the invention asdescribed above using parameters which are likely to produce anacceptable result for a wide spread of genome sizes from differentspecies, for example by performing a digestion of 5 μg genomic DNA usinga 6 nt cutter (e.g. PstI) as the cloning site enzyme and three 4 ntcutters (e.g. HpaII, AluI and DraI). The partition may be cloned intopZErO at PstI site with presence of suitable enhancing linkers (linkersfor HpaII, AluI and DraI).

The following steps are then performed:

-   (vii) sequencing the fragments in a fraction of the colonies (host    cells) in said library,-   (viii) calculating the size of the library (i.e. partition) using    formula F=n(n−1)/Σ_(i)n_(i)(n_(i)−1)±s.

If the partition size is appropriate it can be accepted.

If not (for example it is too small or too big) then the followingfurther steps, in any appropriate order, may be performed:

-   (ix) providing the restriction site frequency (f_(i)) of the enzymes    used in the partition, for example based on sequences obtained at    step (vii),-   (x) calculating the genome size G using the formula:    $N_{{x\quad 1} \sim {x\quad 2}} + {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$    wherein:-   N_(x˜x2) is the number of fragments with length between x1 and x2    (which is F above).-   k is fragment length-   x1 and x2 are upper and lower limits of the size range of the    fragments in the library (these may be assumed as 100 bp and 2000    bp, as described above, or can be verified by the sequence obtained)-   P_(i) is the probability of having a restriction site at any given    base for the ‘i’th enzyme,-   (xi) providing a restriction site frequency (f_(i)) for enzymes not    used in the partition, for example based on sequences obtained at    step (vii) (this can also be expressed as P_(i)),-   (xii) selecting further restriction enzymes on the basis of    restriction site frequency (f_(i)) to generate a desired size of    partition using the formula:    $N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$-   (xiii) producing a further nucleic library in accordance with steps    (i)-(vi) using at least one of these further restriction enzymes.

It should be noted that in reality the possibility of an enzyme cuttingsite being present will vary according to the restriction enzyme inquestion. Preferably, where a sample sequence is unknown, thereforeP_(i) is measured or estimated in silico based on a large number sampleof sequences e.g. from a database.

A corresponding approach may be used with cDNA from an unknown tissuefrom an unknown species. In such case the lower complexity (comparedwith a genome) suggests that PstI as the cloning site restrictionenzyme, and HpaII as the fragment cutter, may be an appropriate startingpoint.

Design of Partitions for Samples of Known Size and Unknown Sequence

Where the approximate genome size (G) is known, in choosing the enzymesto be used in step (i), the restriction site frequency may be assumed tobe randomly distributed i.e. the v=1, wherein, v is the deviationconstant in the formula P=v/256 for four base cutter and P=v/1096 forsix base cutter.

The enzymes to produce a desired partition size are thus selected on thebasis of the formula:${N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}},$

More specifically the formula:$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{{\int k} = {x\quad 2}}{\sum\limits_{{\int k} = {x\quad 1}}}\lbrack {( {1 - {1/4^{4}}} )^{nk}( {1 - {1/4^{6}}} )^{{({1 + m})}k}} \rbrack}}$wherein:

-   k is fragment length (and x1 and x2 are upper and lower limits)-   G is the size of the genome-   n is the number of extra 4 nt cutters-   m is the number of extra 6 nt cutters    is used to select an appropriate combination of 4 nt and 6 nt    cutters.

This can be verified as described above in steps (vii)-(xiii) ifrequired.

A corresponding approach may be used with cDNA from tissues or speciesin which the complexity is known or can be estimated, either directly orby comparison with other species.

Samples with Known Sequence

One or more reference sequences corresponding to the sample nucleic acidmay be known. It will be understood that the sample nucleic acidsequence (inasmuch as it derives from a different source from thereference) is likely to include sequence variation with respect to anyreference and indeed this variation between corresponding sequencesunderlies certain embodiments of the present invention. Nevertheless,since such variations are by definition rare, the reference sequence canbe used to calculate restriction site frequency for restriction enzymeswhich it may be desired to use in the methods described herein.

When the sequence is known, the restriction site frequency of eachenzyme can be provided, and the formula:$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$can be used to select the enzymes to produce a desired partition size,

Where a reference sequence is known, a set of restriction enzyme can bebased on the restriction map of the desired genes and other sequences soas to select them in particular, while still having an appropriatelysized partition.

Some particular practical aspects of the invention will now be discussedin more detail:

Purification

In preferred embodiments the fragments are purified at step (ii).

As described in the Examples hereinafter, fragments may be purified in aconventional manner. In examples herein, the restriction reaction waspassed through a column containing resins (QIAQuick PCR purificationkit, QiaGen), which can effectively adsorb DNA molecules larger than 100bp. After washing with 70% ethanol, the DNA fragments were eluted into30˜50 μl water. An alternative second method used the BioRadClean-A-Gene kit. The third method was to purify the fragments byrunning 1% agarose gel and recovering the DNA by using Promega gelrecovery kit. For the third method, extra DNA should be used, forexample, 10 microgram for rice and pearl millet, 20 microgram for humanand wheat.

Preferred purification techniques will be such as to remove fragments ofless than 100 bases.

Enrichment of Sample

Where a corresponding reference sequence is known, an enrichmentstrategy may be adopted, so that a particular region or gene may betreated. For example, when a particular set of fragments are required tobe enclosed, restriction enzymes may be chosen through a restriction mapof the reference sequence(s). Moreover, if a particular set of genes areneeded to be studied, from the reference sequence, a set of oligos(16˜60 bases preferably 20˜50 bases) could be designed to enrich thegenes e.g. via a hybridization method using magnetic beads withbiotin-labelled oligonucleotides attached on them (see e.g. Edwards K J,Barker J H A, Daly A, Jones C, Karp A (1996) Microsatellite librariesenriched for several microsatellite sequences in plants. BioTechniques20:758-760). This technique may be particularly useful when dealing withrepetitive DNA.

Once the sample is enriched, it may be preferred to use pilot tests toconfirm the size of the total DNA pool.

Enhancement Linkers

In preferred embodiments, enhancement linkers are added prior or duringstep (iv) such that only the desired sub-set of layers being included insaid library. The linkers prevent fragments with compatible restrictionends combining to form artifacts.

Such linkers (which may be provided as a pair of oligonucleotides)comprise:

-   (i) a core sequence, which is selected such that it does not contain    a restriction site and does not have a high probability of    hybridizing to target sequence,-   (ii) a portion that matches the appropriate restricted-end-   (iii) additional sequence to prevent the linkers annealing e.g. an    overhang.

The enhancement linkers are not used for the cloning site restrictionenzyme(s).

Preferred linkers are any of those given in Table 1.

Cloning and Ligation

The terms “cloning” and “ligation” and so on are used herein becausethey will be well understood by those skilled in the art, and can beperformed by standard techniques. Those skilled in the art are well ableto cloned selected fragments into libraries—see, for example, MolecularCloning: a Laboratory Manual: 2nd edition, Sambrook et al, 1989, ColdSpring Harbor Laboratory Press or Current Protocols in MolecularBiology, Second Edition, Ausubel et al. eds., John Wiley & Sons, 1992(or later editions of these works) both of which are specificallyincorporated herein by reference. Generally speaking a typical protocolcan be achieved by exposing a vector restricted with the appropriateenzymes to the selected layers such as to ligate or otherwiseincorporate the heterologous nucleic acid fragments into the vector atthe appropriate cloning site; exposing the ligation product (recombinantvector) to host cells under conditions whereby the vector is taken up bythe cells such as to generate a population of host cells containing thevector; exposing the population of cells to a propagation mediumcomprising a selection agent whereby transformed host cells whichcontain vector incorporating the nucleic acid insert are selectivelygrown or propagated in the medium.

Where desired, one or more pairs of “adaptor” oligonucleotides may beused to bridge the cloning ends of the DNA fragments of interest (i.e.from the layer(s) in the desired sub-set) and the cloning site of thevector(s). The adaptor sequences have appropriate restriction sitesequences (fragment and vector) at each end and a core sequence in themiddle. An example core sequence is 5-CGTAGACGATGCGTGAGAC-3.

In such cases, PCR amplification may optionally be used to enrich thefragments of interest and increase the amount of DNA by using theadaptor sequence as PCR primer. This may be advantageous where thequantity of fragments is relatively low.

Thus, prior to step (iv), the method may optionally include the step ofligating adaptor oligonucleotides to all or part (e.g. generally one orboth layers, if two layers are selected) of the selected sub-set offragments in order to facilitate their ligation into vectors adapted toreceive them.

The adaptor sequences may optionally incorporate extra restrictionsites.

Use for Discovery of Sequence Variation

As described in more detail below, the sample may comprise correspondingnucleic acid from several (e.g. two or more) different sources. Thispermits equivalent partitions to be compared e.g. for the discovery ofsequence variation.

The methods described herein may be used to identify any type of markere.g. microsatellites, minisatellites etc. Preferably the markers areSNPs.

The size of the partition sequences will be chosen to be appropriate tothe number and nature of markers which it is desired to look for. Thus,for example, if ‘S’ different SNPs are required, it may be appropriateto ensure that there are at least that many different unique sequencesin the partition (more preferably twice that many) representing a totallength of S×1000 bases.

Markers can be investigated which are appropriate to the samples. Forexample, the nucleic acid-containing sample can be pooled fromindividuals who share a particular trait (e.g. an undesirable trait,such as a particular disorder, or a desirable trait, such as resistanceto a particular disorder). Sequences can be taken from differentspecies, varieties or populations such as to provide markers forplant-breeding, or phylogenetic studies etc. Preferred target genomes(or cDNA sources) include Human, Arabidopsis, wheat, rice, millet andsoybean genomes.

Thus the invention provides a method for identifying a limitedpopulation of markers in a sample nucleic acid, which method comprises:

-   (a) providing sample nucleic acid from at least 2 different sources,-   (b) providing a representative partition of the sample nucleic acid    in accordance with the methods described herein,-   (c) identifying differences within corresponding sequences from said    different sources contained within the library.

The nucleic acid from different sources may be pooled. However it mayalso be analysed on separate occasions since the methods of theinvention produce a partition of fixed size and fixed content in areproducible manner.

Generally the corresponding sequences from the different sources withinthe partition are sequenced to identify the differences. Such sequencedata is obtained by sequencing the library e.g. to 3-5 times coverage.If desired the actual size of partition can be calculated as describedherein.

The term “corresponding to” in terms of sequence comparisons herein(whether with a known reference, or between different source nucleicacids in a sample) refers to sequences derived from equivalent loci orgenes from two different genomes (e.g. the sequences may be orthologues,homologues, alleles etc.) but which may therefore include differencesbetween them (e.g. by way of mutation, polymorphism, or other sequencevariation which gives rise to nucleic acid “markers”).

Corresponding sequences will generally be at least 80% identical, mostpreferably at least about 90%, 95%, 96%; 97%, 98% or 99% identical.Identity is established by comparison of the full length of thesequences (or the shorter of the sequences). Thus alignment of differentsequencing results, and assessment of the degree of identity betweenthem, can be used to confirm that sequences are indeed correspondingones, and hence that sequence differences between them representpotential markers. For markers which are candidate single nucleotidepolymorphisms, the frequency should preferably not exceed 1% of thetotal number of bases in the shorter of the two sequences—sequenceswhich meet these criteria may be selected as corresponding. Whethersequences are indeed corresponding sequences showing intergenomic orinter-gene variation, rather than e.g. multiple copies in a singlegenome or individual, can be verified if desired by conventional methodsfamiliar to those skilled in the art of SNP identification. For example,intergenome or inter-gene-copy variation is generally larger than theallelic variation so that a phylogenetic tree of the sequences in analignment based on sequence similarity may distinguish the two types ofvariation. If required, SNP candidates can be validated by genotypingand genetic mapping—if the marker segregates and can be mapped to achromosomal location, it would normally be recognized as true allelicvariation.

Use in Genotyping

Many uses of SNPs require: (i) the SNP's map position in the humangenome, and (ii) a genotyping assay for scoring the locus in associationstudies.

Methods for assessment of polymorphisms are reviewed by Schafer andHawkins, (Nature Biotechnology (1998)16, 33-39, and references referredto therein) and include: allele specific oligonucleotide probing,amplification using PCR, denaturing gradient gel electrophoresis, RNasecleavage, chemical cleavage of mismatch, T4 endonuclease VII cleavage,multiphoton detection, cleavase fragment length polymorphism, E. colimismatch repair enzymes, denaturing high performance liquidchromatography, (MALDI-TOF) mass spectrometry, analysing the meltingcharacteristics for double stranded DNA fragments as described by Akeyet al (2001) Biotechniques 30; 358-367.

The assessment of polymorphisms may be carried out on a DNA microchip.One example of such a microchip system may involve the synthesis ofmicroarrays of oligonucleotides on a glass support.Fluorescently—labelled PCR products may then be hybridised to theoligonucleotide array and sequence specific hybridisation may bedetected by scanning confocal microscopy and analysed automatically (seeMarshall & Hodgson (1998) Nature Biotechnology 16: 27-31, for a review).

Thus the invention also provides for a method for making a genotypingmicrochip for use in assaying a limited population of polymorphismswithin a sample (see, e.g., U.S. Pat. Nos. 5,861,242 and 5,837,832).

As with other reduced representation approaches, the present inventioncan facilitate efficient genotyping. Once a set of polymorphisms isisolated, probes or primers for detecting those polymorphisms can beincorporated into such a chip. When it is desirable to assay anindividual for the polymorphisms in the set, nucleic acid is isolatedfrom that individual, and it can be partitioned with the same methodsthat were Used to isolate the original set of polymorphisms.

However, this invention is more flexible than the other reducedrepresentation approaches because it can greatly and flexibly reduce thesize of a partition e.g. to as small as one containing 500 uniquefragments.

For example, if one wishes to genotype a new sample for 10,000, or 1000or 100 SNPs isolated from a specific partition, one couldrestriction-digest the sample; isolate an appropriate partition; andamplify by PCR using primers complementary to a generic linker. Theresulting amplification products could be hybridized to an appropriate‘genotyping array’. Such methods allow the user to concentrate study ononly a limited portion of the entire spectrum of the availablepolymorphisms. By examining only a limited portion of the genome, thismethod has the added benefit of reducing cross-reactivity betweenunrelated genetic sites.

Use for Investigation of Methylation Sensitivity

For methylation sensitivity studies, methylation sensitive andnon-sensitive restriction enzymes may be used separately so that themethylation distribution patterns could be revealed by comparing thetwo.

Computer-Implemented Embodiments

In a further aspect of the present invention, some or all of the stepsof the methods described above may be performed by a digital computer,in particular steps in designing appropriate genome partitions based onreference sequence restriction maps and\or equations as described above.Although this could be done using commercially available sequenceanalysis software and sequence databases, in preferred embodiments abespoke system directly provides the choice of enzymes to use.

Thus the invention provides an automated computer system, comprising acombination of hardware and software, that can rapidly determineoptimised partitions based on a reference sequence, a desired size, andoptionally desired region within the sequence.

Preferably, these aspects of the invention are implemented in computerprograms executing on a programmable computer comprising a processor, adata storage system (including volatile and non-volatile memory and/orstorage elements), at least one input device, and at least one outputdevice. Data input through one or more input devices for temporary orpermanent storage in the data storage system includes sequences. Programcode is applied to the input data to perform the functions describedabove and generate output information. The output information is appliedto one or more output devices, in known fashion.

The program code will include analysis of some or all of the functionsdescribed above, and will include the ability to input a referencesequence, and preferences regarding partition size and optionallypreferred regions to include in the partition. The program code willalso be able to reference (e.g. from a look-up table) restriction sitetarget sequences for different 4 and 6 nt cutters.

The automated system can be implemented through a variety ofcombinations of computer hardware and software. In one implementation,the computer hardware is a high-speed multi-processor computer running awell-known operating system, such as UNIX. In other embodiments personalcomputers using single or multiple microprocessors might also functionwithin the parameters of the present invention.

Each such computer program is preferably stored on a storage media ordevice (e.g., ROM or magnetic diskette) readable by a general or specialpurpose programmable computer, for configuring and operating thecomputer when the storage media or device is read by the computer toperform the procedures described herein. The inventive system may alsobe considered to be implemented as a computer-readable storage medium,configured with a computer program, where the storage medium soconfigured causes a computer to operate in a specific and predefinedmanner to perform the functions described herein.

The invention will now be further described with reference to thefollowing non-limiting Figures and Examples. Other embodiments of theinvention will occur to those skilled in the art in the light of these.

EXAMPLE 1 Methods for Determining Size of Layers and Partitions

Relationship between Enzymes and Layers

When DNA is digested with more than one restriction enzymes, the DNAfragments can be classified into groups based on the restriction endsproduced specifically by the restriction enzymes.

When N different enzymes are used, the maximum number of groups of DNAfragments generated, which are called “layers” herein, is:L=N+(N ² −N)/2

Each layer of DNA fragments can be specifically cloned into a cloningvector at the corresponding restriction site. The specificity isdetermined by the cloning site, which only matches the restrictionfragment ends of the chosen layers.

Combinations of Layers

In principle, any combination of the layers can be cloned into alibrary. The sub-set or combination of layers cloned is termed a“partition” herein. The number of possible partitions will be:P=C _(L) ¹ +C _(L) ² . . . +C _(L) ^(L−1).

For example, when five different enzymes were used, there should be upto 15 layers and 32766 partitions. In practice, it is preferred to useonly a partition containing one or two layers for library construction.Thus, five enzymes could provide 15 or 225 partitions. Given that morethan a hundred of restriction enzymes are available on the market, thenumber of possible partition of a genome is huge.

Estimating Number and Size of Fragments per Layer

The size of a layer depends on the number and the types of enzymes used.

For a given cloning site generated by a 6 nt cutter, $\begin{matrix}{{{Total}\quad{number}\quad{of}\quad{fragments}} = {{total}\quad{number}\quad{of}\quad{restriction}\quad{sites}}} \\{= {\frac{v\quad G}{4^{6}}.}}\end{matrix}$

-   (G stands for genome size in base pairs).-   (v is the frequency deviation for each particular enzyme in a    particular genome, and may be assumed to be 1 unless known or    established to be otherwise).

The possibility of a restriction fragment with length≧k is (1−¼⁶)^(k).

The possibility of obtaining a fragment with length of k is(1−¼⁶)^(k)−(1−¼⁶)^(k+1)

The number of fragments with length between x1 and x2 isN=4⁻⁶ΓG[(1−¼⁶)^(x1)−(1−¼⁶)^(x2)].

With an extra 4 nt cutter, the number of fragments per layer will bereduced because a given fragment could be cut internally, to generatefragments with different combinations of restriction ends, and hence nolong within the original layer. Thus the fragments per layer will bereduced to:$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack {( {1 - {1/4^{4}}} )^{k}( {1 - {1/4^{6}}} )^{k}} \rbrack.}}}$With two extra 4 nt cutters,$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack {( {1 - {1/4^{4}}} )^{2k}( {1 - {1/4^{6}}} )^{k}} \rbrack.}}}$With three extra 4 nt cutters,$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack {( {1 - {1/4^{4}}} )^{3k}( {1 - {1/4^{6}}} )^{k}} \rbrack.}}}$With n extra 4 nt cutters,$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack {( {1 - {1/4^{4}}} )^{nk}( {1 - {1/4^{6}}} )^{k}} \rbrack.}}}$With an extra 6 nt cutter, the number of fragments will be reduced to$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack ( {1 - {1/4^{6}}} )^{2k} \rbrack.}}}$With two extra 6 nt cutters,$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack ( {1 - {1/4^{6}}} )^{3k} \rbrack.}}}$If one 6 nt cutter is used for cloning site, a 4 nt extra cutter and ‘m’6 nt extra cutters are used, the number of fragments will be$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{x\quad 2}{\sum\limits_{x\quad 1}}{\lbrack {( {1 - {1/4^{4}}} )^{nk}( {1 - {1/4}} )^{{({1 + m})}k}} \rbrack.}}}$Herein v′ is a combined frequency deviation so that this formula ispreferred to be used only when v′ is assumed to be one or when pilottest is used to verify the partition design.

In general, the number of fragments with length between x1 and x2 (inbase pairs) is${N_{{x\quad 1} - {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}},$in which P_(i) is the possibility to have a restriction site at any basepair for the ‘i’th enzyme used and P₁ represents that for the enzyme ofthe cloning site.

It should be noted that when a partition is based on fragments havingtwo different restriction ends, the number of matching fragments remainsthe same. Although the number of total fragments is doubled with twoenzymes, the chance of having two different ends is 50%. Therefore, thesize of a partition with one cloning end is the same as that withcombination of two different cloning ends if other restriction enzymes(fragment cutters, the enzymes which do not match the cloning site) arethe same. Thus for the purposes of calculation, the two restrictionenzymes for the cloning site may be counted as one enzyme, with the P1taken as the mean of that of the two enzymes.

In preferred embodiments, most cloned fragments will fall between 100and 2000 base pairs (and hence x1 and x2 may be assumed as 100 bp and2000 bp). This is because smaller fragments, which are not informative,may be removed by purification techniques. Additionally, the selectedrestriction endonuclease(s) will generally cleave the sample nucleicacid molecule at least approximately every 2000 bases. Thus largerfragments will be comparatively rare.

Testing the Number of Unique Fragments—“Pilot Testing”

Since the frequency of a given restriction site varies greatly fromenzyme to enzyme and from genome to genome, the frequency of the enzymesand the actual size of designed partitions needs to be tested unless itis known from a pre-existing sequence.

To evaluate the number of unique fragments in a partition. After thelibrary of a partition is constructed in accordance with the above,randomly pick and sequence 500 well-separated colonies. Assemble them sothat the same sequences will be piled in alignments. Each alignment of asequence may be termed a “contig” or “clique”. The number of uniquefragments in the partition should be F=n(n−1)/Σ_(i)n_(i)(n_(i)−1)±s, inwhich n is the total number of sequence and n_(i) is the number of thesequences in the ith contig. When the number of sequences is big enough,the standard error s could be neglected. (See Appendix I where thederivation is given)

EXAMPLE 2 Use of a Partition to Find DNA Sequence Variation

Partition Strategy

Clearly, the larger the partition, the more sequence reactions areneeded to get sequence pair-wise comparison. It is therefore preferredto keep the size of the partition to the minimum likely to encompass thenumber of sequence variations which it is desired to identify.

For example, when if five hundred SNPs are required for a population ora panel of varieties, the partition should provide more than fivehundreds unique sequences (ideally about 1000). Random sequencing shouldpreferably cover the library 3-5 times—more than 10-times should not benecessary.

The number and types of restriction enzymes should be decided based onthe formulae described above. When the genome sequence is available, therestriction site frequency can be checked and a particular design tocover certain genomic regions or genes can be performed using a known orbespoke programs. Sequence enrichment strategy can also be considered atthat stage.

For a new species and a particular set of enzymes, a pilot test iscarried out to confirm the expected size of the partition is valid inrespect of that genome. For cDNA, a pilot test may be required in eachcase to hone the partitioning.

Sample Preparation

This can be done in conventional manner. For e.g. rice DNA, at least twomicrogram is preferred. For the human genome, more than five microgramDNA is recommended for normal genome partitioning without gel-basedpurification.

Restriction Digestion

Restriction digestion can be performed in one cocktail. However, if theenzymes are optimal in different conditions, two or even three stages ofreaction should be carried out.

Partial digestion can be used as a special way to enlarge a partition.Normally, partial digestion is only performed on one enzyme, whichgenerates the cloning ends.

Use of Enhancing Linkers

For ligation, enhancing linkers can be designed to avoid chimericalsequences and restoring the undesired restriction site during ligation.In the Examples herein, each linker consists of two oligos. The coresequence were 5′-TTGGCGTTTAC-3′ and 3′-CCGCAAATG-5′.

In order to define the core sequence, a set of randomly generated shortsequences were Blast searched against all sequences from differentspecies in EMBL database. 5′-GGCGTTTAC-3′ was selected on the basis thatit had the least hits, and it did not contain a restriction site.

One end of the linker has a overhang ‘TT’ so that no linkage can be madeat this end. The other end has a sticky end with added nucleotides,which matches the restriction sites—this can be linked to the genomicDNA fragments with undesired restriction ends. Because of thecompetition of these linkers, DNA fragments with the same restrictionsite as the linkers will not link to each other to create “false”fragments within given layers.

Thus for each used restriction enzyme (except that for cloning site) acorresponding enhancing linker should be added into the ligationreaction. In preferred embodiments the final concentration of each oligoshould be 0.1 μM. This is conveniently achieved using a stock solutionof each oligo (1 mM) (which can be stored for use e.g. at −20° C. Beforeligation, a ‘cocktail’ of these oligos is made to contain each necessaryoligo with the concentration of 10 μM and 1 μl of the cocktail should beadded in the 100 μl ligation reaction.

Preferred enhancing linkers are listed in Table 1 hereinafter. Therestriction endonuclease in the list is recommended for genomepartitioning.

Cloning

This can be done in conventional manner. Zero Background vector fromInvitrogen was used. Ligation, transformation, colonies picking,miniprep and sequencing were performed using routine DNA libraryconstruction protocols.

Compatibility with Two automated systems (Qiagen Robots 3000 and 8000with QIAprep 96 Turbo BioRobot Kit) was demonstrated showing the utilityof the invention in high-throughput screening.

EXAMPLE 3 SNP Discovery in Rice

Rice is a model plant for cereals. DNA sequences are widely availablefor rice subspecies, Indica and Japonica. The rice genome is about 400million base pairs and has been shot-gun sequenced independently byseveral groups, while at least one other group (Japanese National RiceGenome Project) is using a BAC strategy. Currently, sequences fromHuada⁴ and RGP⁵ are publicly available for Indica and Japonicarespectively.

Genomic DNA was isolated from 20 rice varieties and equally pooled intoone sample (Table 2 below).

Ten μg of the pooled DNA was digested with 0.5 μl of HpaII, AluI, DraIand PstI each in a cocktail with GIB buffer 8. The total volume ofreaction was 100 μl and it was incubated at 37° C. for 12 hoursovernight.

The digested DNA was purified using QIAQuick PCR purification kit,QiaGen. The purified DNA was eluted in 20 μl water and subsequently 5 μlof the purified DNA fragments were used in a 10 μl ligation reaction.Six oligos (as three enhancing linkers for HpaII, AluI and DraI) wereadded into the reaction. They were 5′-TTGGCGTTTAC-3′, 5′-CGGTAAACGCC-3′,5′-TTGGCGTTTAC-3′, 5′-GTAAACGCC-3′, 5′-TTGGCGTTTAC-3′,5′-AATTGTAAACGCC-3′ (see Table 1). The final concentration of each oligowas 0.1 μM. One μl of ligase was used and 0.2 μg pZero vector(InvitroGen) digested with PstI was added. The reaction was at 15° C.for 30 minutes and then kept at −20° C. for subsequent transformation.

The one-shot competent cell (InvitroGen) was used for transformation ofthe E. coli. Kanamycin was used as selection antibiotic. After overnightculture on LB medium agar plate, approximately 600 colonies wereselected. The colonies were cultured in 1.5 ml LB medium and the plasmidDNA was isolated using QuiaGen miniprep kit. Thirty of the plasmid DNAsamples were run on agarose gel to see the size of inserts. Out of thethirty samples, the insert size ranged from 200 to 3000 bp, with averageof 800 bp. The DNA was sequenced using fluorescent-capillary method onABI 3700 (sequence service was provided by John Innes Centre).

The sequences were processed with PreGap4 to cut away the poor sequenceand vector sequence. The sequence with good quality (pregap4 defaultthreshold was used for quality control) can be assembled into contigsusing Gap4.

About 400 pairwise comparisons were found (Table 3), from which 278 SNPcandidates were identified. TABLE 3 Number of sequences and SNPcandidates No. of sequences No. of sequences No. of SNP in each contigNo. of Contig in each contig type candidates 1 212 212 — 2 121 242 222 38 24 46 4 2 8 6 6 1 6 0 8 1 8 4 Total 345 500 278

Using the formula: F=n(n−1)/Σ_(i)n_(i)(n_(i)−1)±s, the size of thepartition was estimated as containing 624 unique colonies (the standarderror was ignored as being insignificant) (Table 3). In thiscalculation,F=500×(500−1)/[212×1×(1−1)+121×2×(2−1)+8×3×(3−1)+2×4×(4−1)+1×6×(6−1)+1×8×(8−1)]≈624;

The average insert size of the colonies was 800 bp. Since rice genome is400 million bp and the size of library was (624×800)bp, the genomepartition was about 1/800 of the whole genome. In another word, thisgenome partitioning design reduced the complexity of the library by 800times.

EXAMPLE 4 SNP Discovery in Pearl Millet

Pearl millet (Table 4) was tested using the procedure set out in Example3. The total number of sequences was 607 from about 800 colonies. Theresult showed that a partition containing about 2000 colonies wereconstructed.

Since the size of pearl millet genome is not known accurately, theactual reduction in complexity of the genome, was not determined, norhas the total number of SNPs been calculated. TABLE 4 Pearl milletvarieties pooled for genome partitioning experiment 1. Tift238D 2.IP10401 3. IP10402 4. IP8214 5. 81B 6. ICMP451 7. LGD-1 8. ICMP85410 9.Tift23DB 10. 843B 11. P7 12. PT732B 13. P1449 14. 841B 15. 863B 16. H7717. PRLT2 18. ICMP501 19. Tift383 20. 700481-21-8

REFERENCES

-   1. J. Craig Venter, et al. 2001. Science 291:1304-1315.-   2. P. Vos, et al. 1995. Nucleic Acids Res 23:4407-4414.-   3. D. Altshuler, et al. 2000. Nature 407: 513-516.-   4. Hua Da rice sequence database:    http://210.83.138.53/rice/tools.php

5. Japanese sequence database: http://rgp.dna.affrc.go.jp/ TABLE 1Sequences of enhancing linkers Acc I 5′-TTGGCGTTTAC-3′ 5′-ATGTAAACGCC-3′5′-CGGTAAACGCC-3′ Aci I 5′-TTGGCGTTTAC-3′ 5′-CGGTAAACGCC-3′ Afl III5′-TTGGCGTTTAC-3′ 5′-CUYGGTAAACGCC-3′ Alu I 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Apo I 5′-TTGGCGTTTAC-3′ 5′-AATTGTAAACGCC-3′ Ban I5′-TTGGCGTTTAC-3′ 5′-GYUCGTAAACGCC-3′ Ban II 5′-TTGGCGTTTACUGCY-3′5′-GTAAACGCC-3′ Bfa I 5′-TTGGCGTTTAC-3′ 5′-TAGTAAACGCC-3′ BsaA I5′-TTGGCGTTTAC-3′ 5′-GTAAACGCC-3′ BsaH I 5′-TTGGCGTTTAC-3′5′-CGGTAAACGCC-3′ BsaJ I 5′-TTGGCGTTTAC-3′ 5′-CNNGGTAAACGCC-3′ BsiE I5′-TTGGCGTTTACUY-3′ 5′-GTAAACGCC-3′ BssK I 5′-TTGGCGTTTAC-3′5′-CCNGGGTAAACGCC-3′ BstN I None is needed. BstU I 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Btg I 5′-TTGGCGTTTAC-3′ 5′-CUYGGTAAACGCC-3′ Cac8 I5′-TTGGCGTTTAC-3′ 5′-GTAAACGCC-3′ Dpn I 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Dpn II 5′-TTGGCGTTTAC-3′ 5′-GATCGTAAACGCC-3′ Dra I5′-TTGGCGTTTAC-3′ 5′-AATTGTAAACGCC-3′ Eae I 5′-TTGGCGTTTAC-3′5′-GGCCGTAAACGCC-3′ Fnu4H I None is needed. Hae II 5′-TTGGCGTTTACGCGC-3′5′-GTAAACGCC-3′ Hae III 5′-TTGGCGTTTAC-3′ 5′-GTAAACGCC-3′ Hha I5′-TTGGCGTTTACCG-3′ 5′-GTAAACGCC-3′ Hinc II 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Hinf I 5′-TTGGCGTTTAC-3′ 5′-ANTGTAAACGCC-3′ HinPl I5′-TTGGCGTTTAC-3′ 5′-CGGTAAACGCC-3′ Hpa II 5′-TTGGCGTTTAC-3′5′-CGGTAAACGCC-3′ Hpy188 I None is needed. HpyCH4 III None is needed.HpyCH4 IV 5′-TTGGCGTTTAC-3′ 5′-CGGTAAACGCC-3′ HpyCH4 V 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Mbo I 5′-TTGGCGTTTAC-3′ 5′-GATCGTAAACGCC-3′ Mnl I Noneis needed. Mse I 5′-TTGGCGTTTAC-3′ 5′-TAGTAAACGCC-3′ Msl I None isneeded. Msp I 5′-TTGGCGTTTAC-3′ 5′-CGGTAAACGCC-3′ Nla III5′-TTGGCGTTTACCATG-3′ 5′-GTAAACGCC-3′ Nla IV 5′-TTGGCGTTTAC-3′5′-GTAAACGCC-3′ Nsp I 5′-TTGGCGTTTACCATG-3′ 5′-GTAAACGCC-3′ Rsa I5′-TTGGCGTTTAC-3′ 5′-GTAAACGCC-3′ Sau3A I 5′-TTGGCGTTTAC-3′5′-GATCGTAAACGCC-3′ Sau96 I 5′-TTGGCGTTTAC-3′ 5′-GNCGTAAACGCC-3′ ScrF INone is needed. Sfc I 5′-TTGGCGTTTAC-3′ 5′-TUYAGTAAACGCC-3′ Sml I5′-TTGGCGTTTAC-3′ 5′-TYUAGTAAACGCC-3′ Taq I 5′-TTGGCGTTTAC-3′5′-CGGTAAACGCC-3′ Tsp509 I 5′-TTGGCGTTTAC-3′ 5′-AATTGTAAACGCC-3′ CviJ INone is needed. CviT I None is needed.

TABLE 2 20 Rice Varieties Series No. RC No. IRGC No. Name 1 1 25833AusJhari 2 8 25885 Lakhsnikajal 3 10 25898 Mimidim 4 17 27502 Walanga 518 27522 Ashmber 6 21 33118 Hnanwa 7 26 34737 Bawoi 8 27 38697 NPE837 928 62154 ASU 10 33 64780 Kalshori 11 36 64792 Narikel Jhupi 12 40 64887Dagpa Bara 13 48 66513 Guru Muthessa 14 50 66529 Podi Niyanwee 15 5866614 Puteh Kaca 16 81 67423 Aguyod 17 88 67720 Banikat 18 98 71496Babalatik 19 178 78333 Khau Muong Pieng 20 181 78369 Nep NgauAppendix I Derivation of Formula, F=n(n−1)/Σ_(i)n_(i)(n_(i)−1)±s.

Assume a pool which has F different/unique sequences and each uniquesequence has very large equal number of copies. Then the size of thispool, in terms of genome partitioning, is F. The chance to randomlyselecting a pair of sequences that are the same is 1/F, because the poolis very large so that taking one sequence off the pool makes almost nodifference to the size.

If P is the total number of pair wise combinations of the same sequencesand P′ is the total number of any pair wise combinations, the chance torandomly selecting a pair of sequences that are the same is also P/P′.Thus, F=P′/P.

If n is the total number of sequences of the pool. P′=n(n−1)/2.

If n_(i) is the number of sequences of the ith unique sequence (orcontigs). i is from 1 to F. $\begin{matrix}{P = {\lbrack {{n_{1}( {n_{1} - 1} )} + {{n_{2}( {n_{2} - 1} )}\quad\ldots} + {n_{F}( {n_{F} - 1} )}} \rbrack/2}} \\{= {\overset{F}{\sum\limits_{1}}{{n_{i}( {n_{i} - 1} )}/2}}} \\{= {\Sigma_{i}{{n_{i}( {n_{i} - 1} )}/2.}}}\end{matrix}$Therefore, F=n(n−1)/Σ_(i)n_(i)(n_(i)−1).

If the number of sequences is small as we are sampling the pool, therewill be a statistical error, which is given as S. As the result,F=n(n−1)/Σ_(i)n_(i)(n_(i)−1)±s.

1. A method for producing a nucleic acid library, which library containsa plurality of different nucleic acid fragments, the combination of saidfragments being a representative partition of the entirety of a samplenucleic acid, the method comprising: (i) digesting the sample nucleicacid with a plurality of different restriction enzymes to generate aplurality of different layers of fragments, wherein each layer is agroup of fragments having a unique combination of restriction ends, andwherein the combination of layers represents the entirety of the samplenucleic acid, (ii) optionally purifying said fragments, (iii) selectinga desired sub-set of layers according to the unique restriction ends ofsaid layers, (iv) ligating said sub-set of layers into vectors adaptedto receive it, (v) transforming host cells with the vectors (vi)culturing said host cells to provide said library containing saidpartition of the sample nucleic acid.
 2. A method as claimed in claim 1wherein the sample is genomic DNA.
 3. A method as claimed in claim 2wherein the sample consists of an entire genome.
 4. A method as claimedin claim 1 wherein the sample optionally comprises genomic DNA and thenumber of and type of the different restriction enzymes used in step(i), and the sub-set of layers selected in step (iii) are selected inorder to generate a library size with a reduced complexity compared tothe sample nucleic acid of at least 10, 100, or 1000-fold.
 5. A methodas claimed in claim 4 wherein between 3 and 6 restriction enzymes areused.
 6. A method as claimed in claim 4 wherein the digestion by onerestriction enzyme is partial, and the group of fragments in theselected layer have restriction ends created by said partial digestion.7. A method as claimed in claim 1 wherein the selected sub-set of layersconsists of one layer.
 8. A method as claimed in claim 1 wherein thesub-set of layers consists of two layers.
 9. A method as claimed inclaim 1 wherein the fragments are purified at step (ii).
 10. A method asclaimed in claim 9 wherein the purification removes fragments of lessthan 100 bases.
 11. A method as claimed in claim 9 wherein the sizerange of the fragments in the library is between 100 and 2000 bps.
 12. Amethod as claimed claim 1 wherein enhancement linkers are added prior orduring step (iv) to prevent undesired sub-sets of layers being includedin said library, said enhancement linkers comprising: (i) a coresequence, (ii) a portion that matches the restricted-end of an undesiredsub-set, and (iii) a sequence to inhibit the fragments in the undesiredsub-set recombining.
 13. A method as claimed in claim 12 wherein theenhancement linkers comprise any of those given in Table
 1. 14. A methodas claimed in claim 1 wherein adaptor oligonucleotides are used in step(iv) to facilitate the ligation of the desired sub-set of layers intovectors adapted to receive it.
 15. A method as claimed in claim 1wherein said sample is derived from an organism selected from the groupconsisting of Human, Arabidopsis, wheat, rice, millet, and soybean. 16.A method as claimed in claim 1 wherein libraries are prepared separatelyusing methylation sensitive and non-sensitive restriction enzymes,whereby comparison of the libraries permits methylation distributionpatterns in the sample to be revealed.
 17. A method as claimed in claim1 wherein the sequence of the sample nucleic acid is known, and thenumber of and type of the different restriction enzymes used in step(i), and the sub-set of layers selected in step (iii) are selected toproduce the desired library size in accordance with the restriction sitefrequency of each enzyme in the sample nucleic acid sequence.
 18. Amethod as claimed in claim 17 wherein the number of and type of thedifferent restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii), are selected in accordance with theformula:$N_{{x\quad 1} - {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 19. A method asclaimed in claim 17 wherein a representative partition of a particularregion is produced in accordance with a restriction map of the samplenucleic acid sequence.
 20. A method as claimed in claim 1 wherein thesize of the sample nucleic acid is known, and the number of and type ofthe different restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii) are selected to produce the desiredlibrary size in accordance with an assumed restriction site frequency ofeach enzyme in the sample nucleic acid.
 21. A method as claimed in claim20 wherein the restriction site frequency within the sample is assumedbased on sequence information from the sample.
 22. A method as claimedin claim 20 wherein the restriction site frequency is assumed to berandomly distributed
 23. A method as claimed in claim 20 wherein therestriction site frequency is assumed based on the sequence informationof the sample or is randomly distributed and the number of and type ofthe different restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii), are selected in accordance with theformula:$N_{{x\quad 1} - {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length G is the size of the sample x1 and x2 are upper andlower limits of the size range of the fragments in the library Pi is theprobability of having a restriction site at any given base for the ‘i’thenzyme.
 24. A method as claimed in claim 23 wherein the restrictionenzymes used in step (i) are 4 and 6 nt cutting restriction enzymes, andare selected on the basis of the formula:$N^{\prime} = {4^{- 12}v^{\prime}G{\overset{{\int k} = {x\quad 2}}{\sum\limits_{{\int k} = {x\quad 1}}}\lbrack {( {1 - {1/4^{4}}} )^{nk}( {1 - {1/4^{6}}} )^{{({1 + m})}k}} \rbrack}}$wherein: k is fragment length G is the size of the sample x1 and x2 areupper and lower limits of the size range of the fragments in the libraryn is the number of extra 4 nt cutters m is the number of extra 6 ntcutters
 25. A method as claimed in claim 20 wherein the size of theresulting library is estimated by the further steps of: (vii) sequencingthe fragments in a fraction of the host cells in said library, (viii)estimating the size of the library using formula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.26. A method as claimed in claim 25 wherein an optimised library isgenerated by the further steps of: (ix) providing a restriction sitefrequency for enzymes not used in step (i), optionally using thesequence information obtained at step (vii), (x) selecting furtherrestriction enzymes on the basis of restriction site frequency togenerate a desired size of partition using the formula$N_{{x\quad 1} - {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length G is the size of the sample x1 and x2 are upper andlower limits of the size range of the fragments in the library Pi is theprobability of having a restriction site at any given base for the ‘i’thenzyme, (xi) producing an optimised nucleic library in accordance withsteps (i)-(vi) using at least one of these further restriction enzymes,(xii) optionally repeating steps (vii) to (xi) until the desired librarysize is obtained.
 27. A method as claimed in claim 1 wherein the size ofthe sample nucleic acid is unknown, and the number of and type of thedifferent restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii) are selected to produce the desiredlibrary size in accordance with an assumed restriction site frequency ofeach enzyme in the sample nucleic acid.
 28. A method as claimed in claim27 wherein the restriction site frequency within the sample is assumedbased on sequence information from the sample.
 29. A method as claimedin claim 28 wherein the restriction site frequency is assumed to berandomly distributed
 30. A method as claimed in claim 27 wherein therestriction site frequency within the sample is assumed based on thesequence information from the sample or is randomly distributed andthree 4 nt- and one 6 nt-cutting restriction enzymes are used in step(i).
 31. A method as claimed in claim 30 wherein HpaII, AluI, DraI, andPstI are used in step (i).
 32. A method as claimed in claim 27 whereinthe size of the resulting library is estimated by the further steps of:(vii) sequencing the fragments in a fraction of the host cells in saidlibrary, (viii) estimating the size of the library using formula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.33. A method as claimed in claim 32 wherein the size of the sample isestimated by the further steps of: (ix) providing the restriction sitefrequency of the enzymes used in step (i), optionally using the sequenceinformation obtained at step (vii), (x) calculating the sample size Gusing the formula:$N_{{x\quad 1} - {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 34. A method asclaimed in claim 33 wherein an optimised library is generated by thefurther steps of: (xi) providing a restriction site frequency forenzymes not used in step (i), optionally using the sequence informationobtained at step (vii), (xii) selecting further restriction enzymes onthe basis of restriction site frequency to generate a desired size ofpartition using the formula$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme, (xiii)producing an optimised nucleic library in accordance with steps (i)-(vi)using at least one of these further restriction enzymes, (xiv)optionally repeating steps (vii) to (xiii) until the desired librarysize is obtained.
 35. A method as claimed in claim 1 wherein the samplenucleic acid comprises nucleic acid from two or more different sourceswhich are pooled to produce a library comprising fragments from each.36. A method for identifying a limited population of markers in a samplenucleic acid, which method comprises: (a) providing sample nucleic acidfrom at least two different sources, (b) providing a library containinga representative partition of the sample nucleic acid in accordance withclaim 1 to, (c) identifying differences within corresponding sequencesfrom said different sources contained within the library
 37. A method asclaimed in claim 36 wherein the two different nucleic sources are takenfrom different individuals.
 38. A method as claimed in claim 36 whereinthe markers are Single Nucleotide Polymorphisms.
 39. A method as claimedin claim 1 wherein the number of and type of the different restrictionenzymes used in step (i), and the sub-set of layers selected in step(iii) are selected in accordance with the output of program code run ona digital computer, which computer comprises a processor, a data storagesystem, at least one input device, and at least one output device, andwhich program code operates on the input of one or both of: (i) areference sequence or restriction map from the sample nucleic acid, (ii)a preference regarding partition size, and optionally preferred regionof the sample to include in the partition.
 40. A method as claimed inclaim 39 wherein the program code includes a look up table includingreference restriction site target sequences for different 4 and 6 ntcutting restriction enzymes.
 41. A method as claimed in claim 39 whereinthe program code performs a function in accordance with a formulaselected from the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 42. A systemfor selecting the number of and type of the different restrictionenzymes used in step (i), and the sub-set of layers selected in step(iii) of the method of claim 1, which system comprises program code runon a digital computer, which computer comprises a processor, a datastorage system, at least one input device, and at least one outputdevice, and which program code operates on the input of one or both of:(i) a reference sequence or restriction map from the sample nucleicacid, (ii) a preference regarding partition size, and optionallypreferred region of the sample to include in the partition.
 43. A systemas claimed in claim 42 wherein the program code includes a look up tableincluding reference restriction site target sequences for different 4and 6 nt cutting restriction enzymes.
 44. A system as claimed in claim43 wherein the program code performs a function in accordance with aformula selected from the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i−)1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 45. A computerprogram for selecting the number of and type of the differentrestriction enzymes used in step (i), and the sub-set of layers selectedin step (iii) of the method of claim 1, which computer program codeoperates on the input of one or both of: (i) a reference sequence orrestriction map from the sample nucleic acid, (ii) a preferenceregarding partition size, and optionally preferred region of the sampleto include in the partition, and wherein the program code includes alook up table including reference restriction site target sequences fordifferent 4 and 6 nt cutting restriction enzymes, and wherein theprogram code performs a function in accordance with a formula selectedfrom the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 46. A computerprogram as claimed in claim 45 which is stored on a storage media ordevice readable by a general or special purpose programmable computer.47. A process for producing a chip for use in assaying a limitedpopulation of polymorphisms within a sample, which process comprises:(i) providing a population of probe sequences, which probe sequences arederived from a representative partition of sample nucleic acid providedin accordance with claim 1, and contain the population of polymorphisms,(ii) incorporating the probe sequences into the chip.
 48. A chipobtainable by the method of claim
 47. 49. A method of genotyping anucleic acid sample from an individual, which method comprises: (i)providing the chip of claim 48, (ii) isolating a representativepartition of sample nucleic acid from the individual in accordance withthe method used to provide the representative partition containing thepopulation of polymorphisms contained in the probe sequences, (iii)contacting the chip with the sample and determining hybridization of thesample nucleic acid thereto.
 50. A method as claimed in claim 4 whereinlibraries are prepared separately using methylation sensitive andnon-sensitive restriction enzymes, whereby comparison of the librariespermits methylation distribution patterns in the sample to be revealed.51. A method as claimed in claim 18 wherein a representative partitionof a particular region is produced in accordance with a restriction mapof the sample nucleic acid sequence.
 52. A method as claimed in claim 23wherein the size of the resulting library is estimated by the furthersteps of: (vii) sequencing the fragments in a fraction of the host cellsin said library, (viii) estimating the size of the library usingformula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.53. A method as claimed in claim 24 wherein the size of the resultinglibrary is estimated by the further steps of: (vii) sequencing thefragments in a fraction of the host cells in said library, (viii)estimating the size of the library using formula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.54. A method as claimed in claim 30 wherein the size of the resultinglibrary is estimated by the further steps of: (vii) sequencing thefragments in a fraction of the host cells in said library, (viii)estimating the size of the library using formula:F=n(n−1)Σ_(i) n _(i)(n _(i) −b)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.55. A method as claimed in claim 31 wherein the size of the resultinglibrary is estimated by the further steps of: (vii) sequencing thefragments in a fraction of the host cells in said library, (viii)estimating the size of the library using formula:F=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard error.56. A method as claimed in claim 4 wherein the number of and type of thedifferent restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii) are selected in accordance with the outputof program code run on a digital computer, which computer comprises aprocessor, a data storage system, at least one input device, and atleast one output device, and which program code operates on the input ofone or both of: (i) a reference sequence or restriction map from thesample nucleic acid, (ii) a preference regarding partition size, andoptionally preferred region of the sample to include in the partition.57. A method as claimed in claim 12 wherein the number of and type ofthe different restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii) are selected in accordance with the outputof program code run on a digital computer, which computer comprises aprocessor, a data storage system, at least one input device, and atleast one output device, and which program code operates on the input ofone or both of: (i) a reference sequence or restriction map from thesample nucleic acid, (ii) a preference regarding partition size, andoptionally preferred region of the sample to include in the partition.58. A method as claimed in claim 23 wherein the number of and type ofthe different restriction enzymes used in step (i), and the sub-set oflayers selected in step (iii) are selected in accordance with the outputof program code run on a digital computer, which computer comprises aprocessor, a data storage system, at least one input device, and atleast one output device, and which program code operates on the input ofone or both of: (i) a reference sequence or restriction map from thesample nucleic acid, (ii) a preference regarding partition size, andoptionally preferred region of the sample to include in the partition.59. A system for selecting the number of and type of the differentrestriction enzymes used in step (i), and the sub-set of layers selectedin step (iii) of the method of claim 4, which system comprises programcode run on a digital computer, which computer comprises a processor, adata storage system, at least one input device, and at least one outputdevice, and which program code operates on the input of one or both of:(i) a reference sequence or restriction map from the sample nucleicacid, (ii) a preference regarding partition size, and optionallypreferred region of the sample to include in the partition.
 60. A systemfor selecting the number of and type of the different restrictionenzymes used in step (i), and the sub-set of layers selected in step(iii) of the method of claim 12, which system comprises program code runon a digital computer, which computer comprises a processor, a datastorage system, at least one input device, and at least one outputdevice, and which program code operates on the input of one or both of:(i) a reference sequence or restriction map from the sample nucleicacid, (ii) a preference regarding partition size, and optionallypreferred region of the sample to include in the partition.
 61. A systemfor selecting the number of and type of the different restrictionenzymes used in step (i), and the sub-set of layers selected in step(iii) of the method of claim 23, which system comprises program code runon a digital computer, which computer comprises a processor, a datastorage system, at least one input device, and at least one outputdevice, and which program code operates on the input of one or both of:(i) a reference sequence or restriction map from the sample nucleicacid, (ii) a preference regarding partition size, and optionallypreferred region of the sample to include in the partition.
 62. Acomputer program for selecting the number of and type of the differentrestriction enzymes used in step (i), and the sub-set of layers selectedin step (iii) of the method of claim 4, which computer program codeoperates on the input of one or both of: (i) a reference sequence orrestriction map from the sample nucleic acid, (ii) a preferenceregarding partition size, and optionally preferred region of the sampleto include in the partition, and wherein the program code includes alook up table including reference restriction site target sequences fordifferent 4 and 6 nt cutting restriction enzymes, and wherein theprogram code performs a function in accordance with a formula selectedfrom the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 63. A computerprogram for selecting the number of and type of the differentrestriction enzymes used in step (i), and the sub-set of layers selectedin step (iii) of the method of claim 12, which computer program codeoperates on the input of one or both of: (i) a reference sequence orrestriction map from the sample nucleic acid, (ii) a preferenceregarding partition size, and optionally preferred region of the sampleto include in the partition, and wherein the program code includes alook up table including reference restriction site target sequences fordifferent 4 and 6 nt cutting restriction enzymes, and wherein theprogram code performs a function in accordance with a formula selectedfrom the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 64. A computerprogram for selecting the number of and type of the differentrestriction enzymes used in step (i), and the sub-set of layers selectedin step (iii) of the method of claim 23, which computer program codeoperates on the input of one or both of: (i) a reference sequence orrestriction map from the sample nucleic acid, (ii) a preferenceregarding partition size, and optionally preferred region of the sampleto include in the partition, and wherein the program code includes alook up table including reference restriction site target sequences fordifferent 4 and 6 nt cutting restriction enzymes, and wherein theprogram code performs a function in accordance with a formula selectedfrom the group consisting ofF=n(n−1)/Σ_(i) n _(i)(n _(i)−1)±s wherein: F is the estimated size ofthe library n is the total number of sequences obtained by sequencing,ni is the number of sequence in the ith contig, s is the standard errorand$N_{{x\quad 1} \sim {x\quad 2}} = {G\quad P_{1}^{2}{\overset{k = {x\quad 2}}{\sum\limits_{k = {x\quad 1}}}{\prod\limits_{i = 1}^{i}( {1 - P_{i}} )^{k}}}}$wherein: Nx1˜x2 is the number of fragments with length between x1 and x2k is fragment length x1 and x2 are upper and lower limits of the sizerange of the fragments in the library Pi is the probability of having arestriction site at any given base for the ‘i’th enzyme.
 65. A processfor producing a chip for use in assaying a limited population ofpolymorphisms within a sample, which process comprises: (i) providing apopulation of probe sequences, which probe sequences are derived from arepresentative partition of sample nucleic acid provided in accordancewith claim 4, and contain the population of polymorphisms, (ii)incorporating the probe sequences into the chip.
 66. A process forproducing a chip for use in assaying a limited population ofpolymorphisms within a sample, which process comprises: (i) providing apopulation of probe sequences, which probe sequences are derived from arepresentative partition of sample nucleic acid provided in accordancewith claim 12, and contain the population of polymorphisms, (ii)incorporating the probe sequences into the chip.
 67. A process forproducing a chip for use in assaying a limited population ofpolymorphisms within a sample, which process comprises: (i) providing apopulation of probe sequences, which probe sequences are derived from arepresentative partition of sample nucleic acid provided in accordancewith claim 23, and contain the population of polymorphisms, (ii)incorporating the probe sequences into the chip.